IMT573 Problem Set 5: Bayes theorem, distributions Your name: Kulraj Singh kohli Deadline: Wed, Nov 6th 3pm Instructions 1. You are welcome to answer some of the questions on paper but please include the result as an image in your final file (a cellphone photo will do). 2. Please write clearly. Imagine this is a business report for your boss. She does not care about coding but is very much interested in the results. Can she understand your text? Test it, by making all code invisible and ensuring one can still understand what you have written.
1 Overbooking flights
You are hired by Air Nowhere to recommend the optimal overbooking rate. It is a small airline that uses a 100-seat plane to carry you from Seattle to, well, nowhere. The tickets cost $100 each, so a fully booked plane generates $10,000 revenue. The sales team has found that the probability, that the passengers who have paid their fare actually show up is 98%, and individual show-ups can be considered independent. The additional costs, associated with finding an alternative solutions for passengers who are refused boarding are $500 per person.
Hint: read OIS ch 3 about distributions.
Ans1.Binomial
Expected revenue is 10,000
#binomial for the question
prob_101<-dbinom(101,101,0.98)
prob_101
## [1] 0.1299672
Probability is 12.9% 4. What are the expected profits (= revenue − expected additional costs) in this case? Would you recommend overbooking over selling just the right number of tickets?
#expected profits = revenue − expected additional costs
expected_profits<-10100-(prob_101)*500*1
Expected Profit is 10035.02 As you can see that overbooking is still going to generate higher expected profits hence overbooking is recommended.
#binomial for the question
prob_102<-dbinom(102,102,0.98)
The probability that all 102 passengers show up is 12.7% 6. What is the probability that 101 passengers still one too many will show up?
#binomial for the question
dbinom(101,102,0.98)
## [1] 0.265133
Probability that 101 passengers show up is 26.5%
#expected profits = revenue − expected additional costs
expected_profit_102<-10200-(2*dbinom(102,102,0.98)+dbinom(101,102,0.98))*500
expected_profit_102
## [1] 9940.066
We see that the expected profit is $9940.066 which is less than the expected profit of 10035 in the case of 101 tickets.
#Writing a function to calculate the expected revenue till 108 and analysing trends
n<-101
final<-c()
while (n<109) {
x<-0
sum<-10000+(n-100)*100
expected_loss<-0
while (x<(n-100)) {
expected_loss<-expected_loss+((n-100)-x)*dbinom(n-x,n,0.98)
x<-x+1
}
tot_expected_loss<-expected_loss*500
final[n-100]<-sum-tot_expected_loss
n<-n+1
}
final
## [1] 10035.016 9940.066 9713.848 9398.369 9036.920 8656.348 8269.083
## [8] 7879.791
We see from the above trend that the most optimal scenario is when the airline sells 101 tickets to customers.
One passenger coming does not affect whether the other passenger is coming or not,means that they are independent events.
It is important because while calculating probability if the events were interdependent we cant use binomial distribution in that case.
Hint: read about independence in OIS 2.1.6 (2017 version). Hint: some of the expressions may be hard to write analytically. Feel free to use computer for the calculations, just show the code and explain what you are doing.
2 Does the student know the answer?
In the exam, there is a multiple-choice question with four (mutually exclusive) options. In average, 80% of the students know the answer, but event those who know, still answer it wrong in 10% of time because of the exam stress.
ANSWER 2
2 3 Histogram and normal distribution
In this problem set you are comparing the humans in terms of body size (height) and influence. Strictly speaking we do not have data on human influence here, just research paper influence (citations) but it is a good proxy for influence of the individual humans (researchers).
Let’s start with the human height data.
citations<-read.csv("mag-in-citations.csv")
head(citations)
## paper_id num_in_citations
## 1 4090687 2
## 2 6537979 2
## 3 7484482 4
## 4 9444380 3
## 5 14056478 5
## 6 14498457 2
human_height<-read.delim("fatherson.csv.bz2")
head(human_height)
## fheight sheight
## 1 165.2 151.8
## 2 160.7 160.6
## 3 165.0 160.9
## 4 167.0 159.5
## 5 155.3 163.3
## 6 160.1 163.2
We see from the data that fheight and sheight have a true 0 value which is also well defined, hence it is a ratio.
It should be measured as continuous and positive.
length(human_height$fheight)
## [1] 1078
table(is.na(human_height$fheight))
##
## FALSE
## 1078
We see from the above results that we have 1078 entries and none of them are missing.
Hint: there is no built-in method to estimate mode in R. Several packages provide a way to do it, I recommend to use modeest::mlv (installedon the server).
However, as height is a continuous variable, there are many ways to compute it. Take a look at the corresponding documentation. You may experiment with a few options and pick one, for instance naive.
summary(human_height$fheight)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 149.9 167.1 172.1 171.9 176.8 191.6
getmode <- function(v) {
uniqv <- unique(v)
uniqv[which.max(tabulate(match(v, uniqv)))]
}
getmode(human_height$fheight)
## [1] 175.4
sd(human_height$fheight)
## [1] 6.972346
The range of the human heights is 49 (149.9-191.9) , Std Deviation is 6.97 , mode is 175.4 , median is 172.1, mean is 171.9. (I assume these values to be in cm)
Mean is the same as median whereas mode is greater than mean.
The std deviation is less, means that most of our data lies around our mean.
library(dplyr)
##
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
##
## filter, lag
## The following objects are masked from 'package:base':
##
## intersect, setdiff, setequal, union
library(ggplot2)
plot1<-human_height %>% ggplot(aes(x = fheight)) + geom_histogram(binwidth =0.5,color="black",fill="white", alpha = 0.4)
plot1
plot2<-human_height %>% ggplot(aes(x = fheight)) + geom_histogram(aes(y=..density..),binwidth=0.5,color="black",fill="white",alpha =0.6)+stat_function(fun = dnorm ,color = "red",args=list(mean=171.9, sd= 6.97))+ scale_x_continuous(name = "Distribution of Height ")+ggtitle("Plot of Normal distribution with same Mean and Std Deviation vs Density plot of actual data")
plot2
plot2 + geom_density(color="black") + geom_vline(aes(xintercept=mean(human_height$fheight)),linetype="dashed",color="magenta1")+ geom_text(aes(x=mean(human_height$fheight), label="Mean", y=0.01,), colour="magenta1",alpha = 0.01, angle=90, vjust =-0.5, text=element_text(size=50)) + geom_vline(aes(xintercept=median(human_height$fheight)),linetype="dashed",color="purple1")+ geom_text(aes(x=median(human_height$fheight), label="Meidan", y=0.03,), colour="purple1",alpha = 0.01, angle=90, vjust =1.5, text=element_text(size=20)) +
geom_vline(aes(xintercept=getmode(human_height$fheight)),linetype="dashed",color="midnightblue")+ geom_text(aes(x=getmode(human_height$fheight), label="Mode", y=0.03,), colour="midnightblue",alpha = 0.01, angle=90, vjust =1.5, text=element_text(size=20))
## Warning: Ignoring unknown parameters: text
## Warning: Ignoring unknown parameters: text
## Warning: Ignoring unknown parameters: text
What do you find? Are the histogram and the density plot similar?
We see from the graph the red line represents our normal distribution and the black line is the density plot for the given data of heights. They are pretty close to each other and hence we can say they are similar.
Next, let’s take a look at human influence. We are working with the number of citations of research papers.
head(citations)
## paper_id num_in_citations
## 1 4090687 2
## 2 6537979 2
## 3 7484482 4
## 4 9444380 3
## 5 14056478 5
## 6 14498457 2
We see that the citations column are discrete positive values and the have a well defined 0 hence are a ratio measure.
length(citations$num_in_citations)
## [1] 388258
table(is.na(citations$num_in_citations))
##
## FALSE
## 388258
We see that there are 388258 citations and none of them are missings
summary(citations$num_in_citations)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.00 1.00 3.00 15.61 12.00 18682.00
getmode(citations$num_in_citations)
## [1] 0
sd(citations$num_in_citations)
## [1] 78.39079
The range of the citations is 18682 (149.9-191.9cm) , Std Deviation is 78.390 , mode is 0 , median is 3, mean is 15.61.
Mean is not the same as median and median is smaller than mean whereas mode is lesser than median.
The std deviation is a large number, means that most of our data does not lie around our mean.
citplot1<-citations %>% ggplot(aes(x = num_in_citations)) + geom_histogram(color="black",fill="white", alpha = 0.4)+ scale_x_log10()+scale_y_log10()
citplot1
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 84550 rows containing non-finite values (stat_bin).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1 rows containing missing values (geom_bar).
citplot2<-ggplot(citations, aes(x = num_in_citations)) + geom_histogram(fill = "magenta3",alpha=0.2, color= "black") + scale_x_log10() + scale_y_log10() + ggtitle("Histogram of number of citations with Normal distribution")+ geom_density(aes(y=0.2*..count..), colour="black", adjust=4)
citplot2
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 84550 rows containing non-finite values (stat_bin).
## Warning: Removed 84550 rows containing non-finite values (stat_density).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1 rows containing missing values (geom_bar).
citplot2 + geom_vline(aes(xintercept=mean(citations$num_in_citations)),linetype="dashed",color="orange1")+ geom_text(aes(x=mean(citations$num_in_citations), label="Mean", y=10,), colour="orange1",alpha = 0.01, angle=90, vjust =-0.5, text=element_text(size=50)) + geom_vline(aes(xintercept=median(citations$num_in_citations)),linetype="dashed",color="purple1")+ geom_text(aes(x=median(citations$num_in_citations), label="Meidan", y=100,), colour="purple1",alpha = 0.01, angle=90, vjust =1.5, text=element_text(size=20)) +
geom_vline(aes(xintercept=getmode(citations$num_in_citations)),linetype="dashed",color="midnightblue")+ geom_text(aes(x=getmode(citations$num_in_citations), label="Mode", y=100,), colour="midnightblue",alpha = 0.01, angle=90, vjust =1.5, text=element_text(size=20))
## Warning: Ignoring unknown parameters: text
## Warning: Ignoring unknown parameters: text
## Warning: Ignoring unknown parameters: text
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## Warning: Transformation introduced infinite values in continuous x-axis
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 84550 rows containing non-finite values (stat_bin).
## Warning: Removed 84550 rows containing non-finite values (stat_density).
## Warning: Transformation introduced infinite values in continuous y-axis
## Warning: Removed 1 rows containing missing values (geom_bar).
What do you find? Are the histogram and the density plot similar? Hint: experiment with log-log scale for the histogram. 9. Finally, comment on your findings about human bodies and influence. ->We see that the human height follows a curve similar to that of normal distribution. ->We see that most number of citations are 0 -> We see that the data of citations is right skewed. ->We see that the median and mode of the human data is almost the same,whereas the median is smaller than the mean in citations data. -> We see that humans have an normal distributin of height, which means that the have a common median and mode most of the heights are at this value and the std deviation is less that means the data does not vary too much -> We see that most people have zero or very less citations and that is the reason why its histogram is right skewed, the std deviation is large that means the data is scattered and there are large outliers.